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Abstract 

A single-index model (SIM) provides for parsimonious multi-dimensional nonlinear 
regression by combining parametric (linear) projection with univariate nonparamet- 
ric (non-linear) regression models. We show that a particular Gaussian process (GP) 
formulation is simple to work with and ideal as an emulator for some types of com- 
puter experiment as it can outperform the canonical separable GP regression model 
commonly used in this setting. Our contribution focuses on drastically simplifying, 
re-interpreting, and then generalizing a recently proposed fully Bayesian GP-SIM com- 
bination. Favorable performance is illustrated on synthetic data and a real-data com- 
puter experiment. Two R packages, both released on CRAN, have been augmented to 
facilitate inference under our proposed model(s). 

Key words: surrogate model, nonparametric regression, projection pursuit 

1 Introduction 

A single-index model (SIM) is a linear regression model with a univariate nonparametric 
link function. It provides a parsimonious way to implement multivariate nonparametric 
regression. Concretely, the SIM is represented by E{Y^|:Ej} = f(xj (3), where (3 = . . . , (3 P ) 
is called the index vector for a p-dimensional predictor variable Xj, and / is called the link. 
The product xj (3 is the index of the i th response, Y{. The parameters (3 and / (which is 
usually infinite dimensional) are estimated jointly. SIMs like these, although with random 



Gaussian predictors, were first formulated by Brillinger (1977, 1982). They have since been 



applied widely in areas such as econometrics and psychometrics (Ichimura, 1993). 



SIMs are a special case of projection pursuit regression (PPR, Friedman and Stuetzle 



1981 Hastie et al. 2001, Section 11.2), which uses M projections and links (called ridge 



functions): E{Yi\xi} = J2m=l /' 
estimation of a nonparametric /, 
involving layers of greedy forward steps, backfitting and cross validation (CV). Although the 
resulting fit may indeed yield high predictive power, it is often uninterpretable. As a result, 



xj (3 m ). Although increasing M provides greater flexibility, 
m can become difficult. Furthermore, inference can be ad-hoc, 
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many authors actually prefer SIM models because the inference is simpler, the interpretation 
is straightforward, and the properties of the estimators are well understood. 



This paper promotes a particular SIM as an emulator for computer experiments ( Santner 



et al. 2003). An emulator is a nonparametric nonlinear predictor for the output Y of com- 
puter simulations run at a design of input configurations X. The typical choice is a Gaussian 
process regression (GPR) model. Emulator parameters (e.g., those for a GPR) often have 
no interpretation. Predictions are made by integrating over their posterior distribution, so 
that decisions based on them incorporate uncertainty in the model fit. Examples include 



sequential design by active learning (Gramacy and Lee, 2009), calibration ( Higdon et al. 
2004), and optimization (Gramacy and Taddy, 2010 )|^J One reason for promoting SIMs as 



emulators is that considerable insights may be gleaned by directly studying the 
quantities, comprising of a projection and indices. 

Most of the literature on SIM inference is frequentist 



nuisance 



-see 



Antoniadis et al. (2004) for a 



nice review. Until very recently, the limited Bayesian work on SIMs focused on using splines 
for the link, / (Antoniadis et al. , 2004; Wang, 2009). The impact of that work is two- fold. It 



represents a new direction in inference for SIM models which, as illustrated empirically, can 
offer improvements over the classical approach. It also suggests a simple way that (Bayesian) 
spline models, which are widely used for univariate nonparametric regression, can be used 
in the multivariate setting where their successes has been far more limited. 



Choi et al. (2011) suggested that a GP prior be placed on /, leading to a so-called GP- 



SIM. The motivation and impact of such a model is not immediately clear. Unlike splines, 
GPRs already scale naturally from one to arbitrary dimensions without projection. So this 
poses the question: what is gained from the GP-SIM approach (either in the SIM context or 
as an emulator for computer experiments where direct GPRs are commonplace)? Choi et al. 



(2011) showed, by simulation, that the GP-SIM reduce predictive error over the canonical 



GPR (with an isotropic correlation function) in some cases, e.g., on synthetic data generated 
within the SIM class, and on a standard real-data benchmark. However the MCMC algorithm 
is much more complicated than SIM with splines or the canonical GPR. 

Much of this extra effort is unnecessary. There is a simpler formulation of the GP- 
SIM which is comparable in computational complexity to both SIM with splines and the 
canonical GPR, and works just as well if not better. Our primary contribution lies in 
communicating this reformulation and its benefits, and promoting the (new) GP-SIM as 
an emulator for computer experiments. We contend that the new formulation is easier to 
implement and portable to more exotic modeling frameworks. In this way our work can be 



seen as a honing of the Choi et al. (2011) approach, and as a subsequent generalization and 



application to an important class of multivariate nonparametric regression problems, namely 
computer experiments. Finally, we provide software as modifications to two existing R ([R] 
Development Core Team 2009 ) packages available on CRAN. Together these support every 



feature discussed herein, including all but one suggested extension. 

The remainder of the paper is organized as follows. In Section [2] we review the GP-SIM 



1 These extend earlier versions by Seo et al. (2000), Kennedy and O'Hagan| ( [200T| ) and Jones et al. (19981, 
respectively, which used point estimates to make the fit. 
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hierarchical model formulation of Choi et al. (2011). We then introduce our simplifications, 
a reformulated model, and a more efficient Monte Carlo inference method. In Section [3] we 
provide some illustrative and comparative examples on synthetic data, essentially extending 
the Choi et al. (2011) results (with the new formulation) to include a comparison to GPRs 
with a separable correlation — a stronger straw man. In Section [4] we provide a detailed exam- 
ple of a real-data computer experiment involving computational fluid dynamics simulations 
of a rocket booster. We conclude in Section [5] with a discussion of extensions including a 
worked example of a treed version of the GP-SIM on the rocket booster experiment, and 
applications to sequential design, optimization, and classification. 



2 Hierarchical model and MCMC inference 
2.1 Original formulation 



The Bayesian hierarchical formulation of the GP-SIM described by Choi et al. (2011) is 
provided below. We have changed the presentation/notation from its original version to make 
some particular points more transparent and to ease the transition to our new formulation. 

Y^Pjy-NUixlP)^ 2 ), for 2 = 1,..., n, 

P ~ II, subject to = f3 T /3 = 1, r 2 ~ IG(a T /2, b T /2) (1) 
f\9,a 2 ~GP(9,a 2 ) = f(Xp\9,a 2 ) ~Ar n (0,a 2 K(Xfc9)), 
9~G(a e ,b e ) and log(a) ~ JV(-1, 1) 

G is the gamma distribution, IG the inverse-gamma distribution, M the normal distribution, 
and a T , b T , ae, be are known constants. The rows of the design matrix X contain xj , . . . x^ . 
The prior distributions for / and /3 require some explanation. 

For / notice the lack of explicit conditioning on X/3 even though its presence is helpful for 
thinking about the relevant finite dimensional distributions. The equivalence (ee) illustrates 
what the GP prior implies when X/3 is supplied as an argument. In shorthand we have 
/„ ee [f(x{/3), . . . , f(Xn(3)] T ~ A/"„(0, a 2 K n ), i.e., a n-variate normal prior distribution with 
zero mean and n x n correlation matrix K n ee K(Xf3] 9) which has (i, j) th entry 

K(xjp, x]fc 9) = exp I ~ * 3 ' \ ■ (2) 

In other words, the GP prior uses a Gaussian correlation function, with length-scale (or 
range) parameter 9, applied to the projected indices Xj3. 

For (5 the important part isn't the choice of prior distribution II, but rather the constraint 
that it lies on the unit p-sphere. The explanation is that we are jointly modeling / and /3 
which interact as f(X/3), so is only identifiable up to a constant of proportionality. In 
practice II is chosen to be uniform, but a more flexible Fisher-von Mises prior distribution 
(which has a built in unit-sphere constraint) can be used if prior information is available. 
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Observe that the model leaves the sign of (3 unidentified since —(3 leads to the same likelihood 
under either specification — an issue that is glossed over by previous Bayesian treatments of 
SIMs. Importantly, there is posterior consistency of /3 under this setup (Choi et al. 2011). 



Inference by Monte Carlo 

Sampling from the posterior distribution of the parameters proceeds by Markov chain Monte 
Carlo (MCMC), specifically by Metropolis-within-Gibbs by iterating through full condition- 
als for f n , a 2 , (3, 9, then r 2 . The actual expressions for the conditionals are not reproduced 
here. The first two are standard distributions (N n and IG, respectively) yielding Gibbs 
updates, whereas the latter three require Metropolis-Hastings (MH). 

The f3 parameter is treated as a single block, and the proposals come in a random- 
walk (RW) fashion using a Fisher-von Mises distribution whose modal parameter is set 



to the previous value (Antoniadis et al. 2004). We add that, in this context, the lack of 



identifiability of the sign of (3 is akin to the label-switching problem for MCMC inference of 
(Bayesian) mixture models. In Appendix [A] we offer some post-processing suggestions for 
resolving the "labels" (signs) of the two "clusters" (modes of the marginal posterior posterior 



for (3). We presume that 9 and a 2 use RW-MH, but Choi et al. (2011) do not provide these 
details. Sampling f n \9, a 2 , a vector of n latent variables, proceeds via the well-known kriging 
equations discussed in more detail for the reformulated model to follow. 

2.2 Reformulation 

Our first important observation has to do with the nature of (the lack of) identifiability of 
/3 without unit ball restriction. In the particular case of a GP prior for /, through K n in 
Eq. (|2j), it is easy to see why it can only be identified up to a constant. The model is over 
parameterized. The quantity /3/y/O is identifiable (up to a sign). Or, in other words, if we 
remove 9 from the model (or fix it to one) and free /3 from its unit-sphere restriction, then /3 
is identifiable up to a single sign, —/3 or /3. In many cases, like in computer experiments (and 
mixture models), it does not matter at all (as long as the MCMC explores all possibilities) 



since (3 is not of direct interest. In others, e.g., where variable selection is important (Wang 



2009), inability to identify signs poses no real issue. When inference for (3 is a primary goal, 
then identifiability is, of course, important. However, we show how some simple heuristics 
for reconciling the signs [Appendix [A] does allow extra explanatory power that is uncanny 
in this context, through the indices X(3. 

Supposing we effectively remove 9 from the model, our second important observation is 
that (3 ought to be treated as a parameter to the correlation function (J5]), which otherwise 
would have none left. Our proposal is to re-interpret the correlation function as 

K*(xi,Xj] (3) = exp { — (xj j3 — Xj/3) 2 } = exp{ — (xi — Xj) T /3/3 T (xi — Xj)}. (3) 

This is a special rank-1 case of a Gaussian correlation structure with an inverse length- 
scale matrix E = /3/3 T . So we have a convenient re-interpretation of the GP-SIM. It is just a 
canonical GPR model with an odd correlation function. Therefore, an equivalent hierarchical 
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model to (JIJ, combining the GP prior and the IID normal likelihood are combined into a 
single expression for Y = (Yi, . . . , Y n ), is 

Y\X, /3, a 2 , V ~Ar n (0,a 2 K n ) 7 with K n = K(X; (3, r,) (4) 
/3~n, a 2 ~IG(a ff /2,6 ff /2), 77 ~ G(a„, &„). 

K(X;f3,7)) is the nugget-augmented correlation function K(xi, xf f3, rj) = K*(x i ,Xj\f3)+r]s ij , 
where K* is given in Eq. ([3]). The parameter 77 is called the nugget parameter. This nugget- 



augmented correlation is known (e.g., Gramacy 2005 Appendix B) to be equivalent to one 



having two separate variance terms (r 2 ,cr 2 ) since rj = t 2 /cx 2 and is preferable when using 
MCMC, as described below. 

Now the first line in Eq. Q is equivalent to the Yi and / lines in Eq. ([I]). The rest, 
i.e., the prior distribution s, are slightly different. We are free to choose whatever prior 
distribution, IT, we wish for /3, but note that a Fisher-von Mises prior distribution would not 
be recommended because this would severely constrain the new model. If prior information 
is available, then any sensible way of encoding it suffices. Our default is \(3j\ ~ G(ap,bp) for 
particular choices of ap and bp, however multivariate normal (MVN) prior distributions on 
f3 may be sensible when prior information is available. In either case, we assume that the 
design matrix X has been pre-scaled to lie in the unit p-cube. This makes choosing sensible 
defaults for ap, bp, a v , and b v much easier [see Section p$J. 

The benefits of this new formulation may not, yet, be readily apparent. There is only one 
fewer parameter. But we can obtain more efficient inference by MCMC since we have elim- 
inated 0{n) latent variables, f n . Our setup suggests implementing the GP-SIM as a GPR 
with a new correlation function, so its implementation (given existing GPR code) is trivial, 
and thus is ripe for extension [see Sectional. Before turning to details of inference and im- 



plementation we remark that the posterior consistency result provided by Choi et al. (2011) 
applies in our reformulated version. To see why, observe that any continuous distribution 
on (3 can be decomposed into a distribution on \\(3\\ and another on For example, 

(3 ~ N(0,I) is the same as \\(3\\ 2 ~ x 2 an d uniform on the sphere. In short, the 

models are essentially identical, and so are their properties. We prefer slightly different prior 
distributions (mainly because of defaults in existing GP software), and these do not mate- 
rially change the nature of the posterior distributions. The key is that our re-interpretation 
allows for a simpler inferential approach. 

Inference by Monte Carlo 

An advantage of the nugget-augmented correlation function is that the a 2 parameter can 
be integrated out analytically in the posterior distribution. This means it is never needed 
directly, even in the predictive equations to follow. We obtain the following marginalized con- 
ditional posterior distribution for any K, which in our particular GP-SIM case is K(X; (3, rj): 

(6 CT /2)^r[K + n)/2] (K + YK-'Y 



p(K\X, Y) = p(K) x — k " J x ( g " ) ■ (5) 



5 



See Gramacy (2005, Appendix A. 2) for a full derivation in a slightly more general setup. 
The quantity p(K) is a stand-in for H(0) x G(r); a v , b v ) and K n is the n x n matrix implied 



by K(X;/3 1 rj). The Jeffreys' prior p(o~ ) oc 1/cr , i.e., choosing (a a ,b a ) = (0,0), is preferred 
when there is no prior information about the scale of covariance. It may be used as long as 
n > 1 and leads to a simplified expression for p(K\X, Y) upon taking T[0] = 1 and 0/0 = 1. 

The significance of this result is that MCMC need only be performed for K via and r). 
So we only need establish a chain for p + 1 parameters compared to n + p + 3 parameters 
in the original formulation. The time required for each MH round is unchanged relative to 
the original formulation at 0(n 3 ) to decompose K n for each newly proposed 0. However, by 
avoiding the unnecessary sampling of n latents in each round, which is 0(n 2 ) [see Eqs. 
below], we not only save (slightly) on computational cost, but also (significantly) reduce the 



Monte Carlo error of the MCMC by having a lower dimensional chain [see Section 3.1 



In the remainder of this section we outline how our MCMC scheme further deviates from 



Choi et al. (2011), and comment on some computational advantages that are available in our 



setup. We start with the nugget i], which requires MH. A good RW proposal is the positive 
uniform sliding window rf ~ Unif [3r//4, 4^/3]. The proposed rj may be accepted or rejected 
according to a ratio involving the proposal probabilities and p(K'\X, Y)/P(K\X, Y) with 
fixed, i.e., implementing Metropolis-within-Gibbs sampling. 

Drawing for fixed 77 can proceed similarly given a suitable proposal for 0. As compo- 
nents of can be highly correlated [see Section 3.1 , we prefer to update in a single block. 



Assuming the support of the prior distribution n for is R p , a RW-MVN proposal centered 
at is a reasonable choice, i.e., 0' ~ N p ((3,Hp). If posterior correlation information about 
is known, e.g., from a pilot MCMC run, this can be used to inform a good choice of Hp 
which can be crucial for obtaining good mixing in the Markov chain. Such tuning would be 
much harder with Fisher-von-Mises distributions in the setup of Choi et al. (2011). Note 
that it helps to reconcile the signs, or "labels" , of /3s sampled from the posterior distribution 
[AppendixjAj before using them to estimate Hp. For the pilot run, or otherwise, we find that 
E p = diag p (0.2) works well when X is pre-scaled to lie in [0, l] p . 

If an application demands that all/both "labels" of be explored a posteriori, then 
we suggest using the signs of Af p (0, Hp) to create a compound proposal: take 0' — s * b, 
a component- wise product where s ~ sign[Ap(0, E^)] and b ~ Af p (0,T,p). Our experience 
is that such random sign changes only slightly alter the MH acceptance ratio when E^ is 
tuned from a pilot run. The reason is that ours is a variation on a scheme that periodically 
tries 0' = —0, which always accepts. Observe that proposing s ~ sign[A/^,(0, E^)] requires 
calculating MVN orthant probabilities for the MH ratio. For this we recommend the method 



of Miwa et al. (2003) as implemented by Craig (2008) 



Sampling from the posterior predictive distribution is easy given a collection of and r\ 
values. The distribution of Y(x*) given X, Y, and K is Student-t with 



mean 
scale 



y(x\X,Y,K) = k^(x)K- 1 Y, 



(6) 



~2 MYV m + YK- l Y)[K{x, x) - kj{x)K^k n {x)] 
a{x\X,Y,K) = — , (7) 



a a + v 
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and degrees of freedom v = n — 1, where k^(x) is the n- vector whose i th component is 
K(x, xf, (3, rj). These equations, which are a minor extension of the classic kriging equations, 
are extremely versatile. They can be used to obtain samples from the posterior predictive 
distribution; to obtain average mean-and-quantile posterior summaries; or as a basis for 



sequential design [see Section 5.2 . We can even sample jointly from the predictive at a 
design of multiple new locations X* to obtain predictive sample paths via a multivariate 
Student-t distribution derived by simple matrix extensions of Eq. (J6^[T]) . All of these would 
be extremely difficult under the original formulation. 

We have so far been leveraging the GPR-only formulation of the GP-SIM model, trying 
to forget its roots as an index model. However, we can still obtain a sample of the indices by 
simply collecting samples of x* T (3 at any location(s) x*. As illustrated in Sections 3.1 and|4j 
this can be useful for assessing goodness of fit and add explanatory /interpretive power even 
when aspects of these quantities are technically not identifiable. 



3 Implementation and illustration 



Here we illustrate our implementation(s) of the re-formulated GP-SIM and on synthetic data, 
turn to real data from a computer experiment in Section |4} We primarily follow Choi et al. 
(2011) and compare GP-SIM to canonical GPRs, using isotropic and separable Gaussian 
family correlation functions. Specifically, 



v 



k=l 



Ok 



+ Vs. 



Or. 



allows different length-scale parameters in 
= p . Choi et al. (2011) did not include a 



K(xi,Xj\0,T]) = exp 

This is the separable case, where — (0i, . . 
each dimension. The isotropic case fixes 9\ 
separable comparator, which (as we shall see) is superior for multivariate regression. 

All experiments used the tgp package defaults for prior and proposal distributions unless 
explicitly stated otherwise. [Section [5] contains the software details.] We did not propose 
random sign changes for (3. Rather, we initialized the MCMC with /3j = 1/2 and allowed the 
chain to explore the mode around one of the two labels. The results are very similar with a 
compound proposal mechanism involving sign changes but requires a more careful applica- 
tion of the methods described in the Appendix to reconcile the signs and make intelligible 
diagnostic and descriptive plots like the ones shown below. 

Our quantitative metric of comparison between predictors is Mahalanobis distance (Mah), 
as proposed by Bastos and O'Hagan (2009). It is similar to RMSE but allows for covari- 
ance in the predicted outputs to be taken into account. For a vector of responses y = 
(y(xi), . . . y(xN)) T at N hold-out predictive locations, the distance is given by Mah(y; /x, E) = 
(y — ;u) T S _1 (?/ — fi), where /i and £ are estimates of the predictive means and covariances 
for the N locations y. The distance can be interpreted as an approximation to the (log) 
posterior predictive probability of y. 
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3.1 Synthetic data in the SIM class 



Consider data generated within the SIM class. The function of the index, t, is periodic: 

. fnt\ 1 /4ttA 
/« = 8 m(-] + -eos(-J. 

The data are observed as Y i: ~ Af(f(xJ/3), 0.1 2 ) with 4- dimensional predictors X{ and = 
(2.85,0.70,0.99,-0.78). In a Monte Carlo experiment we simulated design matrices with 
n = 45 rows uniformly in [0, l] 4 , and then conditionally sampled 45 responses thus forming 
our training set. We similarly simulated a testing design matrix of 200 rows, recording the 
corresponding true (no-noise) response for making predictive comparisons via Mahalanobis 
distance. This was repeated 100 times, each time fitting the three models in question, sam- 
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Figure 1: Square- root Mahalanobis distance results for data generated in the SIM class in 
terms of boxplots (a) and a numerical summary (b). 

pling from their respective predictive distributions, and calculating Mahalanobis distances. 
The results are summarized in Figure [TJ and the ranking they imply is no surprise. What 
is particularly noteworthy is the rarity of deviation from this ordering in all 100 repetitions. 
The GP-SIM always had a lower Mahalanobis distance than the separable GPR which was 
itself better than the isotropic GPR 83% of the time. In fact, observe that the worst GP-SIM 
distance is better than the best of the other two. 

Figure [2] (a)-(d) displays, for one realization, the data, the true function, and predicted 
means for the 200 test points as functions of the true index. The increasing amount of 
scatter in (b) to (d) clearly explains the performance ordering seen in Figure [TJ We can also 
plot the predicted mean versus the fitted (posterior mean) indices. See Figure [2] (e). The 
reduction in scatter between (b) and (e) is due to the change in the horizontal axis. By 
plotting against the posterior mean index in (e), variation in predictions due to uncertainty 
in estimation of the link is masked. The posterior 95% predictive credible interval (CI) of 
the response, which is also plotted as a function of the estimated mean indices, provides 
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(a) (b) 




01234 01234 
true index true index 

Figure 2: Training data (a) and posterior predictive mean values under the GP-SIM (b), 
separable GP (c), and isotropic GP (d) plotted versus the true index. The solid line in 
each plot indicates the true index-response relationship. Panel (e) plots the GP-SIM with 
estimated indices. 

a look at the advantage of the fully Bayesian approach. Ideally, we would like to see the 
posterior uncertainty in the indices in the plot as well, but variability on multiple axes is 
hard to visualize (although some of the uncertainty in the posterior mean indices can be 
seen from the horizontal jitter of the dots). Instead, we prefer to show the variability of the 
indices implicitly though posterior uncertainty in (3. But before doing so, some comments 
on how we obtain the fitted indices are in order. 

First, the unidentifiable sign of (3 can cause lack of identifiability in the sign of the indices. 
This was easy to correct since all of the signs of the components of that were sampled 
by MCMC were the same — we were only exploring one mode — so none of the heuristics 
from Appendix |A| were needed. Therefore no adjustments to the indices were needed either. 
Second, since the length-scale 9 is built-in to (5 the range of the indices will have a different 
scale than the true indices. This is harder to fix, but it isn't really necessary — you can get 
a nice plot of the index- versus- response relationship without adjusting the scale [see Figure 
|2]^e)]. Incidentally, if the data-generating (true) (3 is not on the unit-sphere (and it is not in 
this example) then the same scaling problem arises under the original model formulation. 

The posterior correlation matrix we obtained for /3 is shown in Table [T} Non-negligible 
correlation between the components of means we can expect to obtain lower MC error by 
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Table 1: Posterior correlation matrix for /3, with variances in parentheses. 



using the above estimated values as for future runs. In a second run of the MCMC (of 



equal length) using this new proposal covariance we obtained an effective sample size (Kass 



et al. , 1998) of more than seven times that of the original chain. This improvement would 



not have been possible with a Fisher- von Mises proposal. 




b2 b3 b4 0.5 1.0 1.5 2.0 

theta 



Figure 3: Boxplots (a) collecting normalized samples from the posterior distribution for ft, 
and a histogram (b) showing the implied values of 9 based on the normalization. 

Figure |3](a) shows the resulting samples, normalized to lie on the unit sphere. Observe 
that the posterior density is very tight around the true (normalized) index vector. The nor- 
malizing constant of each ft sample implies a sample of the phantom 9 length-scale parameter 
via a reciprocal and square root. A histogram of these values is shown in panel (b). 



3.2 The borehole data: a deterministic function 



An example that contrasts with the previous one is the borehole function (Worley, 1987), as 

(9) 



studied by many authors (e.g. Morris et al. 1993). The response y is given by 

2nT u [H u - m 



log 



1 + 



2LT U 



log(r/r TO )r-2 K n 



+ 
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The eight inputs are constrained to lie in a rectangular domain: 



r w G [0.05, 0.15] r G [100, 5000] T u G [63070, 115600] T x G [63.1, 116] 
H u G [990, 1110] H[ G [700, 820] L G [1120, 1680] K w G [9855, 12045]. 

We offer some experimental results on data obtained from the borehole function (without 
noise) in order to highlight how the GP-SIM compares to other GPR models when the 
data-generating mechanism is far outside the SIM class. Approximating Eq. ^ with a 
transformation of a linear combination of the eight predictors will be crude in comparison 
to many alternatives. Note that we still fit the GPRs/GP-SIM with a nugget even though 



Gramacy and Lee 


( 


2011 


)■ 


l (LHD, Santner et al. 




2003 



Section 

5.2.2) constrained to the above rectangle, and obtained 250 responses, y. A hold-out testing 
set of size 1000 is similarly obtained, and after fitting the GP-SIM and canonical GPRs as 
in Section [3~T it is used to calculate Mahalanobis distances to measure predictive accuracy. 
This is repeated 100 times, generating 100 distances for each of the three methods. A pilot 
run was used to determine MH proposals for /3, which was subsequently used throughout. 
This initial run also indicated that the marginal posterior distributions for the (5 coefficients 
corresponding to r, T u and T\ was tightly straddling zero, prompting us to discard these 
predictors from the model. 
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Figure 4: Square-root Mahalanobis distance results for the borehole data in terms of boxplots 
(a) and a numerical summary (b). 

The results are summarized in Figure |4j Briefly, we see that the GP-SIM model is 
competitive with the other GPRs on this data. On average (and 90% of the time) it is 
better than the GPR with an isotropic covariance function, but worse than the GPR with 
a separable one (also 90% of the time). So projecting onto the index is helpful before 
measuring correlations with a single length-scale parameter, but having a separate length- 
scale parameter for each input direction is more effective. Figure [5]^a) shows the posterior 
distribution of the responses (mean and 90% CI) as a function of the posterior mean indices, 



11 




Figure 5: Panel (a) shows the posterior mean predicted values, and 90% CI, versus the 
posterior mean fitted indices; panel (b) shows the posterior of the index vector. 



describing the contribution of the projection aspect of the GP-SIM. Panel (b) shows the 
normalized estimates of (5. As in the previous example, the components of (5 from the 
MCMC had identical signs throughout, so no post-processing was needed. 

In practice, one rarely knows the functional form of the data-generating mechanism inti- 
mately enough to know a priori whether an SIM structure (as in Section 3.1 ) or a separable 
structure (as in the current example) is best. It is therefore comforting to see that the GP- 
SIM does not perform arbitrarily badly with data that are (almost pathologically) outside 
of the SIM class. In the next section we show, by example, that computer experiments can 
benefit from the estimation of SIM structure to a surprising degree, especially when one of 
the inputs plays a predominant role in predicting the response. 



4 Emulating a real computer experiment 

To try out the GP-SIM as an emulator for a real computer experiment we turn to a set 
of computational fluid dynamics (CFD) codes that simulate the characteristics of a rocket 
booster, the Langley glide-back booster (LGBB), as it is re-entering the atmosphere. For 



previous uses of this data, and further details on the experiment, see Gramacy and Lee 



(2009). The simulations calculate six aeronautically relevant responses as a function of three 
inputs that describe the state of the booster at re-entry: speed measured in Mach; angle 
of attack a; and side-slip angle (3, both measured in degrees. We shall begin with the roll 
response for a detailed analysis, and revisit the other five responses later. There are 3014 
such quadruplets in the portion of the data set we are concerned with. Peculiar irregularities 
(or features) in the relationship between the inputs and the response, like input-dependent 
noise and regime changes, pose challenges for constructing a good emulator and make this 
experiment interesting. 

First, we wish to see how the GP-SIM measures up against the canonical GPR models. 
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Towards this end, we set up an "inverted" CV experiment wherein we partition the data into 
10 nearly equal-sized folds. Then we iterate over the folds, training the models on the 10% 
block of data inside the fold, and obtaining samples from the posterior predictive distribution 
on the remaining 90% outside the foldfl Since we do not know the true responses at the 
held-out test locations — only the simulated values from the CFD codes are available — the 
variance/covariance aspects of the Mahalanobis distance matrix will play a major role in the 
comparison. Rather than simply penalizing fits which poorly predict a few observations, the 
covariance term acknowledges the model's "explanation" that they are noisy. 
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Figure 6: Square-root Mahalanobis distances for the LGBB roll response in terms of boxplots 
(a) and a numerical summary (b) for numerical specificity. [See Section 5.1 for an explanation 
of the "t-" results.] Observe that y-axis of the boxplot clips some of the outliers in order to 
improve visualization. 



We applied this inverted-CV procedure ten times, randomly, for 100 total folds generating 
100 Mahalanobis distances for GP-SIM and the two GPR models. The results are summa- 
rized in Figure [6] The figure is actually summarizing the results from two experiments, the 



second of which is described later in Section 5.1 For now we focus on the part summarizing 



fits for "iso", "sep", and "sim" comparators (the left part of the plot, or the top part of the 
table). All three versions have similar mean Mahalanobis distances, although the GP-SIM 
model was the best on average. What is particularly striking from the boxplots is that the 
variance of the GP-SIM distances is much smaller than the others, indicating a much more 
reliably good fit. Apparently, projecting the inputs onto a single index, and measuring spa- 



2 We avoid standard CV since the computational demands (required to invert large matrices) would be 
too large for a Monte Carlo experiment. 
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tial correlation on that scale, is better than estimating axis-aligned spatial covariation (the 
separable GPR) or isotropic spatial correlation on the original inputs. 



(a) 

roll response 



(b) 





1 2 
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alpha 



Figure 7: Panel (a) shows the posterior mean predicted values versus the posterior mean 
fitted indices; panel (b) shows the posterior of the index vector. 

Aspects of the GP-SIM fit are shown in Figure The curves in panel (a), showing the 
posterior predictive means and 90% CIs versus the posterior mean fitted indices, go some 
ways towards explaining why the GP-SIM has lower Mahalanobis distances compared to the 
other GPR models. The non-trivial shape and overall smoothness of the estimated index- 
response relationship suggests that the single index explains a great deal of the variation in 
the data. An exception might be for indices in the range (1.5,2.5). Here disparate inputs, 
with similar but distinct input/output relationships, are mapping to nearby indices and the 
single index structure is struggling to cope. A modification to accommodate nonstationarity 
of the response may help [see Section 5.1 . Panel (b) in the same figure shows the posterior 



distribution of the components of the index vector, suggesting a reasonable fit with low MC 
error. The low variance on the /3 components with posterior mass far from the origin suggests 
that all three inputs are relevant predictors. 

The other five responses exhibit broadly similar behavior. The summary of the inverted- 
CV Mahalanobis distances are nearly identical to those obtained for the roll response, so they 
are not duplicated here. It is revealing to look at the relationship between the estimated 
indices and the response, and the corresponding samples from the posterior of the index 
vector, for these other five responses. See Figure [8j The samples of 9 were similar to the 
roll case and so they have been omitted. Some brief comments are in order. The index- 
response relationship in the lift and drag responses is rather tame, and we can see that third 
component of (3, the side-slip angle (also called (3), is likely not a relevant predictor for 
these responses — it tightly straddles zero. A more formal variable selection analysis may be 



warranted (Wang, 2009), but is probably overkill in this particular case of three inputs. The 



pitch index-response relationship bears some similarity to that for roll, and like roll all of 
the components of the index vector seem to be significant. None of these inputs lead to quite 
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Figure 8: Left column shows the posterior predictive means and 90% CI versus the posterior 
mean indices; right shows the posterior of the index vector. The rows are lift, drag, pitch, 
side, and yaw. 
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the same multi-modality of the index-response relationship as in the roll output, suggesting 
that this case is the most challenging of the six, and consequently the most likely to benefit 
from a nonstationary approach to modeling GP correlations. 

To sum up, not only is the GP-SIM a better emulator for this data than the canonical 
GPRs, but it offers scope for interpretation and analysis that is uncanny in the context of 
computer experiments, specifically, and GP models generally. 



5 Discussion of extensions 

Our reinterpretation of the GP-SIM as a GPR model with a rank-1 covariance function 
means that the SIM is, now, extremely modular. The following sub-sections suggest how it 
can be trivially embedded into a number of different environments which either extend or 
generalize the model, or enable it to be used in a new context. With the exception of the last 
one [Section 5.4 , all of these extensions are already implemented in one of the R packages, 



tgp (Gramacy 



2007 Gramacy and Taddy, 2010) or plgp (Gramacy and Poison, 2010) 



both on CRAN. In tgp the GP-SIM functionality is invoked by supplying the argument 
cov = "sim" to the bgp fitting routine. In plgp you specify cov = "sim" in the prior. GP 
function. So these are more than just suggestions. The triviality of the changes required to 
the packages to add in the GP-SIM functionality, essentially adding a few extra routines to 
implement a new covariance function, is a testament to its newfound modularity. 



5.1 Treed GP-SIM 

Some of the challenging aspects of the LGBB experiment, particularly poor fits from sta- 
tionary emulators, motivated a new class of models called the treed Gaussian process (TGP, 



Gramacy and Lee, 2008). The idea is to use a Bayesian tree model (Chipman et al. 2002) 



to infer a partitioning of the space so that independent GPR models could be fit in differ- 
ent locales or regions of the input space, thus facilitating nonstationary, input-dependent, 
modeling. The result was a much better emulator for the LGBB simulations and faster joint 
tree-GP MCMC inference compared to GPs alone due to the divide-and-conquer nature of 
trees. The tgp package was built to implement this new methodology. The implementation 
of canonical GPR models that we have been using so far comes as a convenient special case. 

Since the GP-SIM is just a special GPR model it enjoys the tree extensions as well, giving 
rise to the TGP-SIM. To entertain the possibility that the TGP-SIM might improve upon 
the original TGP model we re-ran the experiment from Section [4] using trees. The results are 
presented in Figure [6] under the headings "t-iso" , "t-sep" , and "t-sim" . All three methods 
benefit from the (axis-aligned) treed partitioning, having lower Mahalanobis distances than 
their non-treed counterparts. This is a testament to the value of the treed partitioning for 
this data. Apparently, projections are the best way of modeling spatial correlation within 
the treed partitions, leading to a similar ordering as for the non-treed results. Unfortunately, 
the interpretation aspects of the GP-SIM model — of inspecting how the estimated indices 
relate to the estimated response — do not so easily translate to the treed version. So while 
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the fit is better with trees, the interpretation suffers, as one might expect from a (much) 
more nonparametric model. 



We also tried the TGP-S1M model on the simple sinusoidal data [Section 3.1 and the 
borehole data [Section |3~2 but they lead to no improvement. The tree never partitioned the 
input space in either case, so the TGP-SIM model reduced to the GP-S1M model. There 
was also no partitioning for the TGP-GPR models, so they reduced to canonical GPRs. 
Finally, partitioning is a natural way to handle factor /categorical inputs. When paired 
with GPRs the result is a flexible nonparametric model for regression functions with mixed 
categorical and real- valued predictors (Broderick and Gramacy, 2010). Therefore the tgp 



implementation represents the first application of SIM models to mixed inputs [see Gramacy 



and Taddy (2010, Section 2) for more details]. 



5.2 Sequential design of experiments 

An important aspect in computer simulation, and consequently an important application 
for GPs, is in the designing of the experiment. Obtaining each response Yj at input x,- t can 
involve running a time consuming computer simulation on an expensive machine. There is 
thus interest in minimizing the number of simulations and subsequently extracting the most 
information from the experiment. A common approach is to design sequentially, using the 
current model fit to suggest new (x, Y) pairs by active learning heuristics. The fit is then 
updated based on the output of the simulation(s) at the new location(s); and repeat. 

Depending on the goal of the experiment some heuristics are more appropriate than oth- 
ers. For example, if maximizing understanding about the (x, y) relationship is the ultimate 
goal, then statistics based on the predictive variance or expected reduction in predictive 
variance work well (Seo et al. , 2000). The former is directly available ([7]) and the latter is 
analytic given the GP parameterization. In sequential applications with stationary GPR 
models the result is a variation on a space-filling design. If the relationship is harder to 
predict in some parts of the input space than others, then a model like TGP is more ap- 
propriate, and the resulting sequential designs thus constructed can be far from space-filling 



(Gramacy and Lee, 2009). The modularity of the GP-SIM means that it is easily applied in 



these contexts. The tgp package implements both types of active learning heuristic and is 
agnostic about the type of correlation (e.g., a rank-1 SIM correlation can be used). 

If optimization — finding x minimizing Y(x) — is the goal, then a statistic called expected 
improvement (EI, Jones et al. 1998) is appropriate. This quantity is also available analyt- 

Calculations of EI, 



ically as a function of the GPR predictive equations (|6j-[7j). Calculations of EI, and some 
generalizations, are supported by both tgp and plgp thereby further extending the appli- 
cability of the GP-SIM and its treed version. For more details on EI in the tgp package, 
see Gramacy and Taddy (2010, Section 4). A particular generalization of EI for constrained 



optimization, with the help of classification models and plgp, is discussed below. 
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5.3 GP-SIM for classification 



GP models are also popular for classification (GPC). By using a GPR model as a prior 
distribution for latent variables which feed into a soft-max function (i.e., an inverse logistic 
mapping) a "likelihood" for categorical responses is implied. See Neal (1998), for details. 



Since the GP-SIM is just a special case of a GPR this means that GP-SIMs can be similarly 
applied for the prior distribution on the latent structure, and thus SIMs may be used for 



classification too. GPCs are implemented in the plgp package; for details see (Gramacy and 



Poison, 2010). Treed GP classification (Broderick and Gramacy 2010) is also possible with 
the SIM covariance, via similar extensions J^J 

A knock-on effect of trivial GP-SIM classification models is that they can be used in tan- 
dem with GP regression models (including SIM) to solve the sequential design problem of 



optimization under unknown constraints. Gramacy and Lee (2010) developed an algorithm 



that leverages a statistic called integrated expected conditional improvement by combining a 
global improvement statistic derived from a GPR posterior with the probability of satisfy- 
ing a constraint from a similar GPC posterior. This two-pronged approach was illustrated 
on a constrained optimization problem arising from computer simulations in health policy 
research. The calculation of these statistics for all GP combinations, including SIM, is im- 
plemented in the plgp package. As above, invoking this new functionality is simply a matter 
of specifying cov = "sim" to a function called prior . ConstGP. In a purely classification 
context a natural design heuristic is the predictive entropy, which is also supported for the 
CGP-SIM model in the package. 



5.4 Multiple-index models 



As a natural extension of the SIM, a multiple-index model (MIM, Xia, 2008) assumes that all 



information in the regression function provided by X is contained in k linear combinations of 
the columns of X, that is, E{yj|xj} = f(xjB) with p x k index matrix B. For identifiability, 
the constraint B T B = Ik is often imposed. Using our new formulation, it is easy to see 
that the hierarchical structure ^ need not change much to implement a GP-MIM. The only 
substantive difference is that the correlation structure (J3]) would now have an inverse length- 
scale parameter of X = BB T , a rank-fc matrix. It is similarly possible to dispense with 
the orthogonality constraint although B is then only identifiable up to an orthogonal k x k 
matrix. Clearly the MCMC becomes more involved with a higher dimensional parameter 
like B, necessitating even more care in the design of RW proposal mechanism. Inference for 



the rank, k, may be facilitated by reversible jump MCMC (Green, 1995) in low-p settings 



Finally, we note that extending GP-SIM and GP-MIM to employ correlations function other 
than the Gaussian case ^ is also possible. 



3 The classification extensions to tgp are not on CRAN, but code may be obtained from the authors. 
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A Heuristics for reconciling the signs 

We see at least three ways of reconciling the signs, or "labels" (in mixture modeling vocabu- 
lary), for the index vector, i.e., or — 0. The first involves looking at the posterior projected 
sample indices, whereas the latter two work with the samples of from the posterior directly. 
We note that if the GP-SIM model is being used solely as a predictive model then there is 
no need to identify the labels. But when aspects of the projection are of direct interest, 
heuristics are needed. This is true under both GP-SIM formulations discussed in this paper, 
or indeed any other Bayesian approach employing a prior distribution for on the (possibly 



scaled) unit ball (e.g., Antoniadis et al. 2004). 



Our preferred heuristic involves collecting a set of sample indices obtained at a reference 
set of predictive locations X uniformly in [0, l] p , i.e., the collection X0^\ for t — 1, . .. , T 
MCMC samples. We usually find that the average of sample indices thus obtained neatly 
cluster into positive or negative groups. The clustering implies a 2-partition of the collection 
of samples, one of which has the "wrong" sign. After "correcting" the sign of the "wrong" 
group (by negating those samples) we obtain sample indices which are on one side of zero on 
average. Once so adjusted, plots of average indices and boxplots/histograms of samples of 
become are easier to interpret (and look much like like the ones we provide in this paper). 
This technique is certainly not fool-proof. In particular, it relies on having a large enough 
MCMC sample and predictive set X so that the average indices cluster neatly. 

A simpler, but perhaps less reliable or automatic, approach involves looking at the sam- 
ples 0^> directly. If some \(3i\ ^> with high posterior probability, say \0\\ ^> 0, then 
reconciling the signs is easy. Just negate each 0^> for which 0® < 0, say. Alternatively, cal- 
culate the covariance matrix of the full sample 0^\ . . . , 0® and identify which components 
of differ in sign via negative sub-diagonal columns or rows of the matrix. 



Figure [9] illustrates these two methods on the sinusoidal synthetic data from Section 3.1 
The top row, panels (a-c), show how the method based on sample indices would play out. 
Although the sample /3s are both positive and negative [panel (a)], panel (b) shows that 
the indices cluster nicely. Panel (c) shows the implied index-response relationship, where 
colors/points indicate which points are in which cluster according the parity of the average 
indices. The bottom row, panel (d), shows the posterior covariance matrix of the samples 
of [shown in panel (a)], indicating that 04 has an opposite sign from the rest. Using this 
heuristic leads to an identical clustering to the one shown in panels (b-c). 

The last heuristic involves a different, perhaps more reliable, approach to finding a point 
estimator for given samples from the posterior. We first normalize these vectors so that 
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Figure 9: Illustrating the index-based heuristic [top row: panels (a-c)], and the covariance 
heuristic [bottom row: panel (d)]. Panel (a) shows the sampled (signed) /3s from the posterior 
via boxplots; panel (b) shows the average indices obtained from that sample; panel (c) shows 
the clustered posterior mean index-response relationship it suggests; and panel (d) shows 
the posterior covariance matrix of the /3s. 



||/?® || = 1. Because of the sign indeterminacy, we cannot use the sample average as an 
estimator. Instead, we define (3 = mxD.p.upu = i J^t=i(l — ((3 T (3^) 2 ) . For interpretation of 
this measure, note that 1 — (/3 T /3®) 2 = sin 2 0® where 8® is the angle between (3 and fa f \ 
It is equal to zero for both 0® = and 0® = ir. Minimizing £^ =1 (1 - (/3 T /3 W ) 2 ) is the 

same as maximizing Ylt=i(fi T P^) 2 = P T (J2t /3®/3® T )/3 and we easily see that f3 is just the 
eigenvector corresponding to the largest eigenvalue of the matrix fa t 'fa t > T . This estimator 
can be used on its own, or to help "choose" a set of signs for the full sample. 
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